CUDA Programming Guide: Mapping Hardware to Software: Compute Capability Versions

Compute Capability (CC) acts as the versioning bridge between virtual architecture (PTX) and real architecture (SASS/Binary). Developers use nvcc to target specific platforms, ranging from desktop/server platforms to embedded platforms, across OS models like Linux 64-bit (LP64) or Windows 64-bit (LLP64).

1. Virtual vs. Real Architectures

The CUDA Toolkit supports GPU architectures from the last two major versions, referenced in Table 29: Feature Support across Compute Capabilities (7.5 to 12.x). We define mappings using flags like: nvcc --generate-code arch=compute_80,code=sm_90 prog.cu. For future-facing targets, flags like nvcc -arch=sm_100 or specialized variants like nvcc -arch=sm_100a are used.

2. The Macro Hierarchy

The compiler uses __CUDA_ARCH__ to branch code. The macro __CUDA_ARCH__ is only defined in device code (e.g., __device__, __global__). More granular control is provided by __CUDA_ARCH_SPECIFIC__ and __CUDA_ARCH_FAMILY_SPECIFIC__. Certain features, like Distributed Shared Memory or specific NaN payloads, require Compute Capability 9.0+ or Compute Capability 10.0 and later.

3. Numerical Limits & Constraints

Precision varies by CC; for example, subnormal handling ensures $2^{-16382} \approx 3.36 \cdot 10^{-4932}$. Hardware limits such as CUDA_DEVICE_MAX_COPY_CONNECTIONS=16 or the .maxnreg PTX directive are strictly enforced based on the target CC version.

TERMINAL bash — 80x24

> Ready. Click "Run" to execute.

QUESTION 1

Where is the __CUDA_ARCH__ macro strictly defined?

In both host and device code to identify the GPU hardware.

Only in device code (__device__, __host__ __device__, or __global__).

Only in the host-side main() function.

In the NVML API headers only.

QUESTION 2

Which command correctly targets a virtual arch of 8.0 and a real arch of 9.0?

nvcc -arch=sm_80,code=compute_90

nvcc --generate-code arch=compute_80,code=sm_90

nvcc --target=cc_8.0_9.0

nvcc -arch=sm_90

QUESTION 3

What is the consequence of declaring 'namespace cuda { struct foo; }' in CUDA code?

It enables high-speed memory access.

It is an error; the 'cuda' namespace is reserved.

It is required for using cuda::std::result_of.

It allows the use of __nv_atomic_load.

QUESTION 4

Which mathematical property is associated with CC 9.x and higher?

Support for 16-byte data types.

Basic support for fabs(x) and sin(x).

Introduction of the warpSize macro.

Ability to use x + y in kernels.

QUESTION 5

What happens when an extended lambda is defined inside a generic lambda using Microsoft Visual Studio host compilers?

It compiles successfully and inlines perfectly.

The host compiler may fail to inline or throw an error.

It enables __managed__ memory by default.

It triggers the on-disk JIT compilation cache.